Introduction

Background and motivation

Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.

Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.

Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.

Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.

Project objectives

The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.

First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.

Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.

Research questions

  • What is the evolution of student performance over time and across the different regions/departments of France?

  • Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?

  • Has the COVID-19 pandemic impacted student performance?

Data

Import data

Data Wrangling

We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :

  1. Translate the column names.
  2. Make sure that all data are of the right type.
  3. Make sure that the time reference (year) are all aligned with the exam session.
  4. Add a column department_fr that with department names matching the ones in ggplot.
  5. Verify if the data set needs further wrangling.

Function

As we will rename all data sets in the same way, we found it useful to create a function. The function has a data frame and a vector as inputs. It checks if the length of the vector is correct, if so it returns a tibble with the column names renamed


rename_df <- function(df, x){
  if (ncol(df) == length(x)){
    names(df) <- c(x)
    df <- as_tibble(df)
  } else {
    stop("Vector is not the right length")
  }
}

dnb_results

This dataset presents the results of the diplôme national du brevet by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.

  1. Translate the column names.

dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)
  1. success_rate is of the form ,% we want it as a double of the form .

dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
                                               gsub(",",".", dnb_results[["success_rate_pct"]])))
  1. We need to harmonize the year variables of all the other data sets to match the logic of this one. The year is the year of the exam session (e.g academic period “2020-2021” is represented as 2021)

  2. We need to add a column department_fr that with department names matching the ones in ggplot.


dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. We want some to know the attribution rate of each mention to simplify the comparison.
dnb_results <- dnb_results %>% 
  mutate(without_pct = admitted_without/admitted*100,
         AB_pct = admitted_AB/admitted*100,
         B_pct = admitted_B/admitted*100,
         TB_pct = admitted_TB/admitted*100
         )

We can see the final table dnb_results below.

establishment_24

This data set gathers all schools which have been awarded the “Generation 2024” label. Its aim is to develop bridges between the school world and the sports movement in order to encourage young people to practice physical activity and sport. This data set contains 6’883 observations.

  1. Translate the column names.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
  1. No problem for this data set
  2. We need to add two variables session_started and session_ended. Indeed as the label has a start and an End date we have to trace the first session and the last session where the establishment have the label generation 2024.Most labellisations start and end in January but a few start and end in middle of the year. Exams take place end of June, beginning of July. Therefore, we will consider labellisation done in August and after as done for the next academic year.

establishment_24 <- establishment_24 %>% 
  mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
                                     month(label_start_date) >  7 ~ year(label_start_date)+1),
         session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
                                   month(label_end_date) >  7 ~ year(label_end_date)+1)
         )
  1. We need to add a column department_fr that with department names matching the ones in ggplot.

establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. We can create a high_school_type variable instead of agricultural_high_school, military_high_school, vocational_high_school and drop those variables.

establishment_24 <- establishment_24 %>% 
  mutate(high_school_type = case_when(agricultural_high_school == 1 ~ "agricultural high school",
                                      military_high_school == 1 ~ "military high school",
                                      vocational_high_school == 1 ~ "vocational high school")) %>% 
    select(region:sport_section, high_school_type, establishment_web:session_ended)

We can see the final table establishment_24 below.

student_housing

This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.

  1. Translate the column names.

housing_names <- c("year_back_to_school", "Academic_region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)
  1. No Problem for this data set

  2. We need to create a session variable as year_back_to_school refers to the beginning of the school year and not the exam session.


student_housing <- student_housing %>% 
  mutate(session = year_back_to_school + 1) %>% 
    select(year_back_to_school,session, everything()) #here just to order variables
  1. We need to add a column department_fr that with department names matching the ones in ggplot.
student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
  str_to_title(.) %>% 
  gsub("Du", "du", .) %>% 
  gsub("De", "de", .) %>% 
  gsub("D'", "D", .) %>%
  gsub("Et", "et", .) %>%
  gsub(" ", "-", .) %>%
  str_replace_all("Corse-du-Sud", "Corse du Sud") %>% 
  str_replace_all("deux-Sevres", "Deux-Sevres") %>% 
  str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%       
  str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>% 
  str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
  1. No need for further data wrangling for this data set

We can see the final table student_housing below.

single_parent

This data set provides information about the single-parent families in each municipality. The census has been made every five years since 2008. This data set contains 104’986 observations

  1. Translate the column names.
sg_parent_names <- c("geocode", "municipality", "year","sing_par")

covid_in_schools

This is a time based data set that gives us information on the COVID tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 11th October, the data set contained 543’974 observations.

  1. Translate the column names.

    can try to implement API (further idea)

#1
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(Covid_sp_dep_7j_cage_scol_2022_10_10_19h02,covide_names)
  1. test_week will be treated in (3.). positive, incidence_rate and positivity_rate need to be doubles
covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))
  1. We need to create two new variables. The first will be the date categorizing each week. We chose the first date (Monday). The test for a session will be those from August to July of the next year. As our argument will be set on the month, we might have some test done the first days of august count towards the “wrong” session. The number of Covid cases in August are relatively low compared to the rest of the year and it represents at maximum 6 days of tests. Therefore we consider this margin of error to be satisfactory.
covid_in_schools <- covid_in_schools %>% 
  mutate(test_date = ymd(substr(test_week,1,10)),
         session = case_when(month(test_date) <= 7 ~ year(test_date),
                             month(test_date) >  7 ~ year(test_date)+1))
  1. Only department code. We need to input the names of the departments.

  2. To simplify the dataset, we need to drop all but 11-15 educational_level.

We can see the final table covid_in_schools below.

Auxiliary data sets

We will use the ggplot France map for our visualizations

map <- map_data("france")

The region variable is in fact the departments. A quick renaming is necessary.

colnames(map)[5]<- "department_fr"
  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

Exploratory data analysis

Essai

We are mostly exploring our options and have yet to truly deep dive into this portion of our project.

Box plot Analysis per department

success_rate_pct

without_pct

B_pct

AB_pct

TB_pct

Box plot Analysis of highest performing department

Paris 2020

Guyane 2006


ggplot(dnb_results)+
  geom_bar(mapping = aes(x = registered, fill = department))

Number of admitted by region


dnb_results %>% 
 select(admitted, region) %>% 
 group_by(region) %>% 
 summarise(sum = sum(admitted)) %>% 
 ggplot(aes(x = region, y = sum, fill = region)) + 
 geom_col() +
 theme(axis.title.x=element_blank(),
       axis.text.x=element_blank(),
       axis.ticks.x=element_blank()
       )

Average TB percentage by region


dnb_results %>% 
 select(TB_pct, region) %>% 
 group_by(region) %>% 
 summarise(mean = mean(TB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = region, y = mean, fill = region)) + 
 geom_col() +
 theme(axis.title.x=element_blank(),
       axis.text.x=element_blank(),
       axis.ticks.x=element_blank()
       )

Average AB percentage by region


dnb_results %>% 
 select(AB_pct, region) %>% 
 group_by(region) %>% 
 summarise(mean = mean(AB_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = region, y = mean, fill = region)) + 
 geom_col() +
 theme(axis.title.x=element_blank(),
       axis.text.x=element_blank(),
       axis.ticks.x=element_blank()
       )

Average B percentage by region


dnb_results %>% 
 select(B_pct, region) %>% 
 group_by(region) %>% 
 summarise(mean = mean(B_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = region, y = mean, fill = region)) + 
 geom_col() +
 theme(axis.title.x=element_blank(),
       axis.text.x=element_blank(),
       axis.ticks.x=element_blank()
       )

Average percentage of admitted without mention students by region


dnb_results %>% 
 select(without_pct, region) %>% 
 group_by(region) %>% 
 summarise(mean = mean(without_pct, na.rm = TRUE)) %>% 
 ggplot(aes(x = region, y = mean, fill = region)) + 
 geom_col() +
 theme(axis.title.x=element_blank(),
       axis.text.x=element_blank(),
       axis.ticks.x=element_blank()
       )

Average success rate by department


dnb_results %>% 
  select(success_rate_pct, region, department) %>% 
  group_by(department) %>% 
  summarise(mean = mean(success_rate_pct, na.rm = TRUE)) %>% 
  ggplot(aes(x = department, y = mean, fill = department)) + 
  geom_col() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()
        )

regression Essai

lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#> 
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + 
#>     dnb_results$AB_pct + dnb_results$without_pct)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -2.30e-09  0.00e+00  0.00e+00  0.00e+00  1.53e-10 
#> 
#> Coefficients:
#>                          Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept)              1.00e+02   1.53e-13  6.54e+14   <2e-16 ***
#> dnb_results$without_pct -1.00e+00   1.63e-15 -6.12e+14   <2e-16 ***
#> dnb_results$B_pct       -1.00e+00   2.93e-15 -3.41e+14   <2e-16 ***
#> dnb_results$AB_pct      -1.00e+00   1.96e-15 -5.11e+14   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.25e-12 on 139537 degrees of freedom
#>   (39 observations deleted due to missingness)
#> Multiple R-squared:     1,   Adjusted R-squared:     1 
#> F-statistic: 1.79e+29 on 3 and 139537 DF,  p-value: <2e-16

cluster Essai

dnb_pct_dep <- dnb_results %>%
  group_by(department, session) %>% 
  summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
            B_pct_dep = mean(B_pct, na.rm = TRUE),
            TB_pct_dep = mean(TB_pct, na.rm = TRUE),
            without_pct_dep = mean(without_pct, na.rm = TRUE),
            success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))

pairs(dnb_pct_dep[2:6])

distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion

mydata.hclust <- hclust(distance)
plot(mydata.hclust)

esssai carte

Creation of the map theme


map_theme <- theme(title=element_text(),
                   plot.title=element_text(margin=margin(20,20,20,20), size=18, hjust = 0.5),
                   axis.text.x=element_blank(),
                   axis.text.y=element_blank(),
                   axis.ticks=element_blank(),
                   axis.title.x=element_blank(),
                   axis.title.y=element_blank(),
                   panel.grid.major= element_blank(), 
                   panel.background= element_blank()) 

Creation of the dataset used for the map


result <- dnb_results %>% 
  select(department_fr, success_rate_pct) %>% 
  group_by(department_fr) %>% 
  summarise(mean = mean(success_rate_pct))

Join the map from ggplot and our new dataset


result_map <- left_join(x = map[,-6], y = result)

plot


ggplot(result_map, aes(long,lat, group = group, fill = mean)) +
  geom_polygon() +
  coord_map() +
  scale_fill_gradient(name = "Average sucess  rate") +
  labs(x = "", 
       y = "", 
       title = "Average success rate, xxxx-2021") +
  map_theme  
#for other examples to change colours https://colinfay.me/mapping-the-french-second-round-results-with-r/

  • Mapping out the underlying structure
  • Identifying the most important variables
  • Univariate visualizations
  • Multivariate visualizations
  • Summary tables

Analysis

Conclusion